Data Science Concepts and Analysis

Announcements

  • Homework 1 posted, due January 27
    • Can work with and submit with 1 partner
  • Office Hours:
    • Utso: Friday, 3-5pm.
    • Yaxuan:
      • Wednesday, 11 - 12pm (Room 113, Building 434),
      • Friday 11 - 12pm (Zoom)
    • Yuxi:
      • 11am-12pm on Monday in PHELP 2524.
    • Franks: Thursday 1-2pm, Zoom: https://tinyurl.com/stat100oh

See Canvas for details

This week

  • Tabular data
    • Many ways to structure a dataset
    • Few organizational constraints ‘in the wild’
  • Principles of tidy data: matching semantics with structure
    • Data semantics: observations and variables
    • Tabular structure: rows and columns
    • The tidy standard
    • Common messes
    • Tidying operations
  • Transforming data frames
    • Subsetting (slicing and filtering)
    • Derived variables
    • Aggregation and summary statistics

Tabular data

  • Many possible layouts for tabular data
  • ‘Real’ datasets have few organizational constraints

Most data are stored in tables, but there are always multiple possible tabular layouts for the same underlying data.

Reading data

mammal_data <- read_csv("data/allison1976.csv")
mammal_data
# A tibble: 62 × 11
   species          body_wt brain_wt non_dreaming dreaming total_sleep life_span
   <chr>              <dbl>    <dbl>        <dbl>    <dbl>       <dbl>     <dbl>
 1 Africanelephant  6.65e+3   5712           NA       NA           3.3      38.6
 2 Africangiantpou… 1   e+0      6.6          6.3      2           8.3       4.5
 3 ArcticFox        3.38e+0     44.5         NA       NA          12.5      14  
 4 Arcticgroundsqu… 9.2 e-1      5.7         NA       NA          16.5      NA  
 5 Asianelephant    2.55e+3   4603            2.1      1.8         3.9      69  
 6 Baboon           1.06e+1    180.           9.1      0.7         9.8      27  
 7 Bigbrownbat      2.3 e-2      0.3         15.8      3.9        19.7      19  
 8 Braziliantapir   1.6 e+2    169            5.2      1           6.2      30.4
 9 Cat              3.3 e+0     25.6         10.9      3.6        14.5      28  
10 Chimpanzee       5.22e+1    440            8.3      1.4         9.7      50  
# ℹ 52 more rows
# ℹ 4 more variables: gestation <dbl>, predation <dbl>, exposure <dbl>,
#   danger <dbl>

Piping

  • Pipes are used to chain a sequence of multiple operations.

  • Makes code more readable and concise

  • The pipe operator takes the output of one function and passes it as the first argument to the next function.

  • %>% in tidyverse syntax or |>

  • There are subtle differences between them. I will try to stick to |>

Piping

## Piping approach
mammal_data |> 
  filter(startsWith(species, "African")) |> 
  select(species, body_wt, brain_wt)
# A tibble: 2 × 3
  species                body_wt brain_wt
  <chr>                    <dbl>    <dbl>
1 Africanelephant           6654   5712  
2 Africangiantpouchedrat       1      6.6

is the same as…

## same as
filtered_data <- filter(.data=mammal_data, 
                        startsWith(species, "African"))
select(filtered_data, species, body_wt, brain_wt)
# A tibble: 2 × 3
  species                body_wt brain_wt
  <chr>                    <dbl>    <dbl>
1 Africanelephant           6654   5712  
2 Africangiantpouchedrat       1      6.6

is the same as…

## Same as...
mammal_data[startsWith(mammal_data$species, "African"), c("species", "body_wt", "brain_wt")]
# A tibble: 2 × 3
  species                body_wt brain_wt
  <chr>                    <dbl>    <dbl>
1 Africanelephant           6654   5712  
2 Africangiantpouchedrat       1      6.6

Why tidy?

  • Many possible layouts for tabular data

  • ‘Real’ datasets have few organizational constraints

Most data are stored in tables, but there are always multiple possible tabular layouts for the same underlying data.

Mammal data: long layouts

Below is the Allison 1976 mammal brain-body weight dataset from week 1 lecture shown in two ‘long’ layouts:

# import brain and body weights
mammal1 <- read_csv('data/allison1976.csv') |> select(1:3)
mammal1
# A tibble: 62 × 3
   species                 body_wt brain_wt
   <chr>                     <dbl>    <dbl>
 1 Africanelephant        6654       5712  
 2 Africangiantpouchedrat    1          6.6
 3 ArcticFox                 3.38      44.5
 4 Arcticgroundsquirrel      0.92       5.7
 5 Asianelephant          2547       4603  
 6 Baboon                   10.6      180. 
 7 Bigbrownbat               0.023      0.3
 8 Braziliantapir          160        169  
 9 Cat                       3.3       25.6
10 Chimpanzee               52.2      440  
# ℹ 52 more rows

Longer layout

# A tibble: 5 × 3
  species                measurement  weight
  <chr>                  <chr>         <dbl>
1 Africanelephant        body_wt     6654   
2 Africanelephant        brain_wt    5712   
3 Africangiantpouchedrat body_wt        1   
4 Africangiantpouchedrat brain_wt       6.6 
5 ArcticFox              body_wt        3.38

Wide format

# A tibble: 2 × 63
  measurement Africanelephant Africangiantpouchedrat ArcticFox
  <chr>                 <dbl>                  <dbl>     <dbl>
1 body_wt                6654                    1        3.38
2 brain_wt               5712                    6.6     44.5 
# ℹ 59 more variables: Arcticgroundsquirrel <dbl>, Asianelephant <dbl>,
#   Baboon <dbl>, Bigbrownbat <dbl>, Braziliantapir <dbl>, Cat <dbl>,
#   Chimpanzee <dbl>, Chinchilla <dbl>, Cow <dbl>, Deserthedgehog <dbl>,
#   Donkey <dbl>, EasternAmericanmole <dbl>, Echidna <dbl>,
#   Europeanhedgehog <dbl>, Galago <dbl>, Genet <dbl>, Giantarmadillo <dbl>,
#   Giraffe <dbl>, Goat <dbl>, Goldenhamster <dbl>, Gorilla <dbl>,
#   Grayseal <dbl>, Graywolf <dbl>, Groundsquirrel <dbl>, Guineapig <dbl>, …

GDP growth data: wide layout

Here’s another example: World Bank data on annual GDP growth for 264 countries from 1961 – 2019. The raw layout is shown below.

gdp1 <- read_csv('data/annual_growth.csv')
head(gdp1)
# A tibble: 6 × 61
  `Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
  <chr>          <chr>           <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Aruba          ABW                NA     NA     NA     NA     NA     NA     NA
2 Afghanistan    AFG                NA     NA     NA     NA     NA     NA     NA
3 Angola         AGO                NA     NA     NA     NA     NA     NA     NA
4 Albania        ALB                NA     NA     NA     NA     NA     NA     NA
5 Andorra        AND                NA     NA     NA     NA     NA     NA     NA
6 Arab World     ARB                NA     NA     NA     NA     NA     NA     NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
#   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …

GDP growth data: long layout

Here’s an alternative layout for the annual GDP growth data:

# A tibble: 5 × 4
  `Country Name` `Country Code` year    GDP
  <chr>          <chr>          <chr> <dbl>
1 Afghanistan    AFG            1961   NA  
2 Albania        ALB            1961   NA  
3 Algeria        DZA            1961  -13.6
4 American Samoa ASM            1961   NA  
5 Andorra        AND            1961   NA  

SB weather data: long layouts

A third example: daily minimum and maximum temperatures recorded at Santa Barbara Municipal Airport from January 2021 through March 2021.

weather1 <- read_csv('data/sb_weather.csv')
# convert to date format specifying month/day/year format
weather1
# A tibble: 88 × 5
   STATION     NAME                                   DATE       TMAX  TMIN
   <chr>       <chr>                                  <chr>     <dbl> <dbl>
 1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/1/2021     65    37
 2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/2/2021     62    38
 3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/3/2021     60    42
 4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/4/2021     72    43
 5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/5/2021     61    40
 6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/6/2021     71    39
 7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/7/2021     73    38
 8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/8/2021     79    36
 9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/9/2021     71    39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/10/2021    67    37
# ℹ 78 more rows
weather1 <- weather1 |> mutate(DATE = mdy(DATE)) ## from the lubridate package
weather1
# A tibble: 88 × 5
   STATION     NAME                                   DATE        TMAX  TMIN
   <chr>       <chr>                                  <date>     <dbl> <dbl>
 1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-01    65    37
 2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-02    62    38
 3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-03    60    42
 4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-04    72    43
 5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-05    61    40
 6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-06    71    39
 7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-07    73    38
 8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-08    79    36
 9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-09    71    39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-10    67    37
# ℹ 78 more rows

SB weather data: long layouts

# A tibble: 88 × 5
     DAY MONTH  YEAR  TMAX  TMIN
   <int> <dbl> <dbl> <dbl> <dbl>
 1     1     1  2021    65    37
 2     2     1  2021    62    38
 3     3     1  2021    60    42
 4     4     1  2021    72    43
 5     5     1  2021    61    40
 6     6     1  2021    71    39
 7     7     1  2021    73    38
 8     8     1  2021    79    36
 9     9     1  2021    71    39
10    10     1  2021    67    37
# ℹ 78 more rows

SB weather data: wide days

# A tibble: 6 × 36
  STATION     NAME   MONTH  YEAR Type    `1`   `2`   `3`   `4`   `5`   `6`   `7`
  <chr>       <chr>  <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 USW00023190 SANTA…     1  2021 TMAX     65    62    60    72    61    71    73
2 USW00023190 SANTA…     1  2021 TMIN     37    38    42    43    40    39    38
3 USW00023190 SANTA…     2  2021 TMAX     66    67    69    63    66    68    60
4 USW00023190 SANTA…     2  2021 TMIN     45    40    44    37    38    38    38
5 USW00023190 SANTA…     3  2021 TMAX     68    66    59    62    67    69    60
6 USW00023190 SANTA…     3  2021 TMIN     37    36    36    37    39    43    47
# ℹ 24 more variables: `8` <dbl>, `9` <dbl>, `10` <dbl>, `11` <dbl>,
#   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
#   `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
#   `24` <dbl>, `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#   `30` <dbl>, `31` <dbl>

UN development data: multiple tables

A final example: United Nations country development data organized into different tables according to variable type.

undev1 <- read_csv('data/hdi3.csv', na = '..') |> select(-hdi_rank)

undev2 <- read_csv('data/hdi2.csv', na = '..') |> 
    select(-one_of(c('hdi_rank', 'maternal_mortality')))

Here is a table of population measurements:

head(undev1, n=3)
# A tibble: 3 × 6
  country     total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65
  <chr>           <dbl>         <dbl>      <dbl>      <dbl>      <dbl>
1 Norway            5.4          82.6        0.3        3.5        0.9
2 Ireland           4.9          63.4        0.3        3.2        0.7
3 Switzerland       8.6          73.8        0.4        5.7        1.6

And here is a table of a few gender-related variables:

head(undev2, n=3)
# A tibble: 3 × 5
  country     gender_inequality parliament_pct_women labor_participation_women
  <chr>                   <dbl>                <dbl>                     <dbl>
1 Norway                  0.045                 40.8                      60.4
2 Ireland                 0.093                 24.3                      56  
3 Switzerland             0.025                 38.6                      62.9
# ℹ 1 more variable: labor_participation_men <dbl>

UN development data: one table

Here are both tables merged by country:

# A tibble: 189 × 10
   country              total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65
   <chr>                    <dbl>         <dbl>      <dbl>      <dbl>      <dbl>
 1 Norway                     5.4          82.6        0.3        3.5        0.9
 2 Ireland                    4.9          63.4        0.3        3.2        0.7
 3 Switzerland                8.6          73.8        0.4        5.7        1.6
 4 Hong Kong, China (S…       7.4         100          0.3        5.2        1.3
 5 Iceland                    0.3          93.9        0          0.2        0.1
 6 Germany                   83.5          77.4        4         54         18  
 7 Sweden                    10            87.7        0.6        6.2        2  
 8 Australia                 25.2          86.1        1.7       16.3        4  
 9 Netherlands               17.1          91.9        0.9       11          3.4
10 Denmark                    5.8          88          0.3        3.7        1.2
# ℹ 179 more rows
# ℹ 4 more variables: gender_inequality <dbl>, parliament_pct_women <dbl>,
#   labor_participation_women <dbl>, labor_participation_men <dbl>

A better glimpse

glimpse(undev_combined1)
Rows: 189
Columns: 10
$ country                   <chr> "Norway", "Ireland", "Switzerland", "Hong Ko…
$ total_pop                 <dbl> 5.4, 4.9, 8.6, 7.4, 0.3, 83.5, 10.0, 25.2, 1…
$ urban_pct_pop             <dbl> 82.6, 63.4, 73.8, 100.0, 93.9, 77.4, 87.7, 8…
$ pop_under5                <dbl> 0.3, 0.3, 0.4, 0.3, 0.0, 4.0, 0.6, 1.7, 0.9,…
$ pop_15to64                <dbl> 3.5, 3.2, 5.7, 5.2, 0.2, 54.0, 6.2, 16.3, 11…
$ pop_over65                <dbl> 0.9, 0.7, 1.6, 1.3, 0.1, 18.0, 2.0, 4.0, 3.4…
$ gender_inequality         <dbl> 0.045, 0.093, 0.025, NA, 0.058, 0.084, 0.039…
$ parliament_pct_women      <dbl> 40.8, 24.3, 38.6, NA, 38.1, 31.6, 47.3, 36.6…
$ labor_participation_women <dbl> 60.4, 56.0, 62.9, 54.0, 70.8, 55.3, 61.4, 60…
$ labor_participation_men   <dbl> 67.2, 68.4, 73.8, 67.5, 79.2, 66.6, 67.8, 70…

UN development data: one (longer) table

And here is another arrangement of the merged table:

# A tibble: 3,780 × 5
   country gender_variable     gender_value population_variable population_value
   <chr>   <chr>                      <dbl> <chr>                          <dbl>
 1 Norway  gender_inequality          0.045 total_pop                        5.4
 2 Norway  gender_inequality          0.045 urban_pct_pop                   82.6
 3 Norway  gender_inequality          0.045 pop_under5                       0.3
 4 Norway  gender_inequality          0.045 pop_15to64                       3.5
 5 Norway  gender_inequality          0.045 pop_over65                       0.9
 6 Norway  parliament_pct_wom…       40.8   total_pop                        5.4
 7 Norway  parliament_pct_wom…       40.8   urban_pct_pop                   82.6
 8 Norway  parliament_pct_wom…       40.8   pop_under5                       0.3
 9 Norway  parliament_pct_wom…       40.8   pop_15to64                       3.5
10 Norway  parliament_pct_wom…       40.8   pop_over65                       0.9
# ℹ 3,770 more rows

Many layouts

Pause and reflect

Return to one of the examples and review the different layouts.

  • If you had to pick one layout, which would you choose?

  • Why would you choose that one?

    • Aesthetic preference?
    • “Just makes sense this way”?
    • “This way is better because…”?
  • Can you envision advantages or disadvantages to different layouts?

Few organizational constraints

It’s surprisingly difficult to articulate reasons why one layout might be preferable to another.

Possibly for this reason, most data are stored in a layout that made intuitive sense to someone responsible for data management or collection at some point in time.

  • Usually the choice of layout isn’t principled

  • Idiosyncratic: two people are likely to make different choices

As a result:

  • Few widely used conventions

  • Lots of variability ‘in the wild’

  • Datasets are often organized in bizarre ways

Consequences for the data scientist

Because of the wide range of possible layouts for a dataset, and the variety of choices that are made about how to store data, data scientists are constantly faced with determining how best to reorganize datasets in a way that facilitates exploration and analysis.

Broadly, this involves two interdependent choices:

  • Choice of representation: how to encode information.
    • Example: parse dates as ‘MM/DD/YYYY’ (one variable) or ‘MM’, ‘DD’, ‘YYYY’ (three variables)?
    • Example: use values 1, 2, 3 or ‘low’, ‘med’, ‘high’?
    • Example: name variables ‘question1’, ‘question2’, …, or ‘age’, ‘income’, …?
  • Choice of form: how to display information
    • Example: wide table or long table?
    • Example: one table or many?

Remedy: the tidy data standard

Instead of addressing these challenges – choice of form and representation – anew every single time, it is immensely helpful to have a set of organizational principles to standardize the process of rearranging data.

The tidy data standard is a principled way of organizing data values. It has two main advantages:

  1. Facilitates workflow by establishing a consistent dataset structure.
  2. Principles are designed to make transformation, exploration, visualization, and modeling easy.

This is a pretty intuitive idea. Many (most?) other things are easier when they’re thoughtfully organized.

Principles of tidy data

  • Tidy data matches semantics with structure
    • Data semantics: observations, variables, units
    • Tabular structure: rows and columns
    • The tidy data standard
    • Common messes
    • Tidying operations

Matching semantics with structure

“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.” Wickham and Grolemund, R for Data Science, 2017.

Matching semantics with structure

A dataset is a collection of values.

  • semantics: meaning
    • data semantics refers to the meaning of each value
  • structure: form
    • data structure refers to how values are arranged

The tidy standard: data semantics \(\longleftrightarrow\) data structure

Semantics: units, variables, and observations

Data semantics refers to the meaning of values. To introduce some general vocabulary, each value is

  • an observation

  • of a variable

  • taken on a unit.

To be precise:

  • An observational unit is the entity measured.

  • A variable is an attribute measured on each unit.

  • An observation is a collection of measurements taken on one unit.

Identifying units, variables, and observations

Let’s do an example. In the GDP growth data:

Term Definition Example
Observational units Entity measured Countries
Variables Attributes measured Year, GDP growth
Observations Set of measurements per unit Annual records
# third record
gdp2[3,]
# A tibble: 1 × 4
  `Country Name` `Country Code` year    GDP
  <chr>          <chr>          <chr> <dbl>
1 Algeria        DZA            1961  -13.6

So, below, -13.605441 (variable) in 1961 (variable) is a record (observation) for Algeria (unit).

Identifying units, variables, and observations

In the weather data:

Term Definition Instance
Observational unit Entity measured SB airport weather station (location)
Variables Attributes measured Min temp, max temp, date, station info
Observations Set of measurements per unit Daily records
# first record
weather1[1, ]
# A tibble: 1 × 5
  STATION     NAME                                   DATE        TMAX  TMIN
  <chr>       <chr>                                  <date>     <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-01    65    37

For example: 65 degrees Farenheit is the maximum temperature (one variable) recorded on a day (one observation) at the SB airport weather station (unit).

Data structure

Data structure refers to the form in which it is stored.

In this context, that means a tabular arrangement of a dataset comprising:

  • rows

  • columns

As we saw, there are multiple structures available to represent any dataset.

The tidy standard

The tidy standard consists in matching semantics and structure.

We can now make that precise. A dataset conforming to the tidy standard is organized so that:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each table contains measurements on only one type of observational unit.

Tidy or messy?

Let’s revisit some of our examples of multiple layouts, starting with gdp1.

head(gdp1, n=3)
# A tibble: 3 × 61
  `Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
  <chr>          <chr>           <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Aruba          ABW                NA     NA     NA     NA     NA     NA     NA
2 Afghanistan    AFG                NA     NA     NA     NA     NA     NA     NA
3 Angola         AGO                NA     NA     NA     NA     NA     NA     NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
#   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …

Tidy or messy?

We can compare the semantics and structure for alignment:

Semantics
Observations Annual records
Variables GDP growth and year
Observational units Countries
Structure
Rows Countries
Columns Value of year
Tables Just one

Rules 1 and 2 are violated, since column names are values (of year), not variables. Not tidy.

Tidy or messy?

In gdp2:

head(gdp2, n=4)
# A tibble: 4 × 4
  `Country Name` `Country Code` year    GDP
  <chr>          <chr>          <chr> <dbl>
1 Afghanistan    AFG            1961   NA  
2 Albania        ALB            1961   NA  
3 Algeria        DZA            1961  -13.6
4 American Samoa ASM            1961   NA  

Tidy or messy?

Comparison of semantics and structure:

Semantics
Observations Annual records
Variables GDP growth and year
Observational units Countries
Structure
Rows Annual records
Columns GDP growth and year
Tables Just one

All three rules are met: rows are observations, columns are variables, and there’s one unit type and one table. Tidy.

Tidy or messy?

head(weather2, n=4)
# A tibble: 4 × 8
  STATION     NAME                      DATE        TMAX  TMIN   DAY MONTH  YEAR
  <chr>       <chr>                     <date>     <dbl> <dbl> <int> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-01    65    37     1     1  2021
2 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-02    62    38     2     1  2021
3 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-03    60    42     3     1  2021
4 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-04    72    43     4     1  2021

Try this one on your own. Then compare with your neighbor.

  1. Identify the observations and variables
  2. What are the observational units?

Tidy or messy?

In undev1 and undev2:

head(undev1, n=3)
# A tibble: 3 × 6
  country     total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65
  <chr>           <dbl>         <dbl>      <dbl>      <dbl>      <dbl>
1 Norway            5.4          82.6        0.3        3.5        0.9
2 Ireland           4.9          63.4        0.3        3.2        0.7
3 Switzerland       8.6          73.8        0.4        5.7        1.6
head(undev2, n=3)
# A tibble: 3 × 5
  country     gender_inequality parliament_pct_women labor_participation_women
  <chr>                   <dbl>                <dbl>                     <dbl>
1 Norway                  0.045                 40.8                      60.4
2 Ireland                 0.093                 24.3                      56  
3 Switzerland             0.025                 38.6                      62.9
# ℹ 1 more variable: labor_participation_men <dbl>

Here there are multiple tables. To discuss:

  • Are the observational units the same or different?
  • Based on your answer above, is the data tidy or not?

Common messes

These examples illustrate some of the most common kinds of messiness:

  • Columns are values, not variables
    • GDP data gdp1: columns are 1961, 1962, …
  • Multiple variables are stored in one column
    • Mammal data mammal2: weight column contains both body and brain weights
  • Variables or values are stored in rows and columns
    • Weather data weather3: date values are stored in rows and columns, each column contains both min and max temperatures
  • Measurements on one type of observational unit are divided into multiple tables.
    • UN development data: undev1 stores population variables; undev2 stores gender-related variables.

Tidying operations

These common messes can be cleaned up by some simple operations:

  • pivot_longer
    • reshape a dataframe from wide to long format
  • pivot_wider
    • reshape a dataframe from long to wide format
  • joins
    • combine two dataframes row-wise by matching the values of certain columns

pivot_longer

Pivoting resolves the problem of having values stored as columns (common mess 1).

pivot_longer

To illustrate with gdp1:

head(gdp1, n=3)
# A tibble: 3 × 61
  `Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
  <chr>          <chr>           <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Aruba          ABW                NA     NA     NA     NA     NA     NA     NA
2 Afghanistan    AFG                NA     NA     NA     NA     NA     NA     NA
3 Angola         AGO                NA     NA     NA     NA     NA     NA     NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
#   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …
gdp1 |> pivot_longer(cols=3:61,
                     names_to="Year",
                     values_to="GDP") |> 
  arrange(Year, `Country Name`) # sort by year first then country
# A tibble: 15,576 × 4
   `Country Name`      `Country Code` Year     GDP
   <chr>               <chr>          <chr>  <dbl>
 1 Afghanistan         AFG            1961   NA   
 2 Albania             ALB            1961   NA   
 3 Algeria             DZA            1961  -13.6 
 4 American Samoa      ASM            1961   NA   
 5 Andorra             AND            1961   NA   
 6 Angola              AGO            1961   NA   
 7 Antigua and Barbuda ATG            1961   NA   
 8 Arab World          ARB            1961   NA   
 9 Argentina           ARG            1961    5.43
10 Armenia             ARM            1961   NA   
# ℹ 15,566 more rows

Pivoting to a wider format

Pivoting to a wider format resolves the issue of having multiple variables stored in one column (common mess 2).

Pivot wider

For example, the mammal2 layout can be put in tidier form with pivot_longer:

head(mammal2, n=3)
# A tibble: 3 × 3
  species                measurement weight
  <chr>                  <chr>        <dbl>
1 Africanelephant        body_wt       6654
2 Africanelephant        brain_wt      5712
3 Africangiantpouchedrat body_wt          1
mammal2 |> pivot_wider(
    names_from = 'measurement', # which variable(s) do you want to send to new column names?
    values_from = 'weight' # which variable(s) do you want to use to populate the new columns?
) 
# A tibble: 62 × 3
   species                 body_wt brain_wt
   <chr>                     <dbl>    <dbl>
 1 Africanelephant        6654       5712  
 2 Africangiantpouchedrat    1          6.6
 3 ArcticFox                 3.38      44.5
 4 Arcticgroundsquirrel      0.92       5.7
 5 Asianelephant          2547       4603  
 6 Baboon                   10.6      180. 
 7 Bigbrownbat               0.023      0.3
 8 Braziliantapir          160        169  
 9 Cat                       3.3       25.6
10 Chimpanzee               52.2      440  
# ℹ 52 more rows

Pivot longer and wider

Common mess 3 is a combination of messes 1 and 2: values or variables are stored in both rows and columns. Pivoting and melting in sequence can usually fix this. weather3 illustrates this issue:

weather3
# A tibble: 6 × 36
  STATION     NAME   MONTH  YEAR Type    `1`   `2`   `3`   `4`   `5`   `6`   `7`
  <chr>       <chr>  <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 USW00023190 SANTA…     1  2021 TMAX     65    62    60    72    61    71    73
2 USW00023190 SANTA…     1  2021 TMIN     37    38    42    43    40    39    38
3 USW00023190 SANTA…     2  2021 TMAX     66    67    69    63    66    68    60
4 USW00023190 SANTA…     2  2021 TMIN     45    40    44    37    38    38    38
5 USW00023190 SANTA…     3  2021 TMAX     68    66    59    62    67    69    60
6 USW00023190 SANTA…     3  2021 TMIN     37    36    36    37    39    43    47
# ℹ 24 more variables: `8` <dbl>, `9` <dbl>, `10` <dbl>, `11` <dbl>,
#   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
#   `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
#   `24` <dbl>, `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#   `30` <dbl>, `31` <dbl>

Pivot longer and wider

First Pivot Longer

## First move date columns into a variable
weather3_long <- weather3 |> 
  pivot_longer(cols = 6:36,
    names_to = 'day',
    values_to = 'temp'
  )
weather3_long
# A tibble: 186 × 7
   STATION     NAME                                MONTH  YEAR Type  day    temp
   <chr>       <chr>                               <dbl> <dbl> <chr> <chr> <dbl>
 1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  1        65
 2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  2        62
 3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  3        60
 4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  4        72
 5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  5        61
 6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  6        71
 7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  7        73
 8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  8        79
 9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  9        71
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 TMAX  10       67
# ℹ 176 more rows

Pivot longer and wider

Then pivot_wider

## First move date columns into a variable
weather3_tidy <- weather3_long |> 
  pivot_wider(names_from = Type, values_from = temp)
weather3_tidy
# A tibble: 93 × 7
   STATION     NAME                                MONTH  YEAR day    TMAX  TMIN
   <chr>       <chr>                               <dbl> <dbl> <chr> <dbl> <dbl>
 1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 1        65    37
 2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 2        62    38
 3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 3        60    42
 4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 4        72    43
 5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 5        61    40
 6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 6        71    39
 7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 7        73    38
 8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 8        79    36
 9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 9        71    39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 10       67    37
# ℹ 83 more rows

Pivot longer and wider

All in one chunk

## First move date columns into a variable
weather3 |> 
  pivot_longer(cols = 6:36,
               names_to = 'day',
               values_to = 'temp'
               ) |> 
  pivot_wider(names_from = Type, 
              values_from = temp)
# A tibble: 93 × 7
   STATION     NAME                                MONTH  YEAR day    TMAX  TMIN
   <chr>       <chr>                               <dbl> <dbl> <chr> <dbl> <dbl>
 1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 1        65    37
 2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 2        62    38
 3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 3        60    42
 4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 4        72    43
 5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 5        61    40
 6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 6        71    39
 7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 7        73    38
 8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 8        79    36
 9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 9        71    39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C…     1  2021 10       67    37
# ℹ 83 more rows

Pivoting

Joins

Joining resolves the issue of storing observations or variables on one unit type in multiple tables (mess 4). The basic idea is to combine by matching rows.

However, there are a number of different joining rules (corresponding to SQL joins).

Join

The code below combines columns in each table by matching rows based on country.

inner_join(undev1, undev2, by = 'country')
# A tibble: 189 × 10
   country              total_pop urban_pct_pop pop_under5 pop_15to64 pop_over65
   <chr>                    <dbl>         <dbl>      <dbl>      <dbl>      <dbl>
 1 Norway                     5.4          82.6        0.3        3.5        0.9
 2 Ireland                    4.9          63.4        0.3        3.2        0.7
 3 Switzerland                8.6          73.8        0.4        5.7        1.6
 4 Hong Kong, China (S…       7.4         100          0.3        5.2        1.3
 5 Iceland                    0.3          93.9        0          0.2        0.1
 6 Germany                   83.5          77.4        4         54         18  
 7 Sweden                    10            87.7        0.6        6.2        2  
 8 Australia                 25.2          86.1        1.7       16.3        4  
 9 Netherlands               17.1          91.9        0.9       11          3.4
10 Denmark                    5.8          88          0.3        3.7        1.2
# ℹ 179 more rows
# ℹ 4 more variables: gender_inequality <dbl>, parliament_pct_women <dbl>,
#   labor_participation_women <dbl>, labor_participation_men <dbl>

Joins

There are various rules for exactly how to join, but the general syntactical procedure to merge dataframes df1 and df2 is this.

  • Specify keys: the shared columns to use for matching rows of df1 with rows of df2.
    • for example, joining on date will align rows in df2 with rows of df1 that have the same value for date
  • Specify a rule for which rows to return after merging

Inner join

Other joins

  • Inner join: keep only rows in both
  • Left join: keep all left rows
  • Right join: keep all right rows
  • Outer join: keep all rows from both

Tidying facilitates transformation

Why use the tidy standard? Wouldn’t any system of organization do just as well?

The tidy standard has three main advantages:

  1. Having a consistent system of organization makes it easier to focus on analysis and exploration.

  2. Transformation of tidy data is especially natural in most computing environments due to vectorized operations.

  3. Many tools for exploration, visualization, and modeling are designed to work with tidy data inputs.

Visualization

Plotting libraries often operate in such a way that tidy data is easier to plot.

weather1 |> ggplot(aes(x=DATE, y=TMAX)) + 
  geom_line()

Exceptions?

There will be situations where you’ll need to deviate from the tidy format for various purposes.

Sometimes, plotting and table construction require reshaping data into non-tidy formats, for example:

weather4 <- weather1 |> 
  pivot_longer(cols = c("TMIN", "TMAX"),
               values_to="Temp", 
               names_to="Min/Max")

weather4 |> ggplot(aes(x=DATE, y=Temp, col=`Min/Max`)) + 
  geom_line()

Exceptions?

mammal2 |> filter(species %in% c("Rabbit", "Raccoon", "Sheep")) |> 
  ggplot() + 
  geom_col(aes(y=weight, x=species, fill=measurement), 
           position="dodge") + 
  coord_flip()

Tidy data is not easy to look at

  • Tidy data is optimized for computation and visualization

  • Most data in a spreadsheet won’t be tidy!

  • When presenting tables, tidy is usually not optimal

Review: The Tidy Data Format

The tidy data format is defined by three key principles:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The tidy format

Tidy data typically has these characteristics:

  • Each cell contains a single value.
  • Column headers are variable names, not values.
  • Variables are stored in columns, not rows.
  • Different types of variables are stored in different tables.

Tidy Data Format

Advantages:

  • Consistent format
  • Easy visualization with ggplot2
  • Simplifies data manipulation.
    • Tools like dplyr and tidyr work efficiently with tidy data, making data transformation more straightforward.
  • Facilitates analysis
    • Many statistical and machine learning functions in R expect data in a tidy format.

Tidy Data Format

Disadvantages:

  • Learning curve
  • Storage inefficiency
  • Not always intuitive
    • For some types of data or analyses, a wide format might be more natural
    • Not good for examining by eye (e.g. in a spreadsheet)

Review

  • In tidy data, rows and columns correspond to observations and variables.
    • This provides a standard dataset structure that facilitates exploration and analysis.
    • Many datasets are not stored in this format.
  • Transformation operations are a lot easier with tidy data.
    • Due in part to the way tools in pandas are designed.
    • The goal of lab 1 is to learn these operations.
  • There are situations where non-tidy data is useful.
    • In PSTAT100, these will usually arise in plotting and tabulation tasks.

Up next

  • Slicing rows by column
  • Filtering rows by logical conditions
  • Defining new variables from scratch or by operations on existing variables (mutate)
  • Aggregations (min, mean, max, etc.)